The Slovene BNSI Broadcast News database and reference speech corpus GOS: Towards the uniform guidelines for future work
نویسندگان
چکیده
The aim of the paper is to search for common guidelines for the future development of speech databases for less resourced languages in order to make them the most useful for both main fields of their use, linguistic research and speech technologies. We compare two standards for creating speech databases, one followed when developing the Slovene speech database for automatic speech recognition – BNSI Broadcast News, the other followed when developing the Slovene reference speech corpus GOS, and outline possible common guidelines for future work. We also present an add-on for the GOS corpus, which enables its usage for automatic speech recognition.
منابع مشابه
BNSI Slovenian broadcast news database - speech and text corpus
This paper presents the BNSI Slovenian Broadcast News database project. The result of the project is a database with speech and text corpus oriented toward large vocabulary continuous speech recognition in general domain. The speech corpus consists of 36 hours of transcribed evening and late night news. The raw database material was captured in the archive of national broadcaster RTV Slovenia t...
متن کاملSINOD - Slovenian non-native speech database
This paper presents the SINOD database, which is the first Slovenian non-native speech database. It will be used to improve the performance of large vocabulary continuous speech recogniser for non-native speakers. The main quality impact is expected for acoustic models and recogniser’s vocabulary. The SINOD database is designed as supplement to the Slovenian BNSI Broadcast News database. The sa...
متن کاملThe goo300k corpus of historical Slovene
The paper presents a gold-standard reference corpus of historical Slovene containing 1,000 sampled pages from over 80 texts, which were, for the most part, written between 1750 – 1900. Each page of the transcription has an associated facsimile and the words in the texts have been manually annotated with their modern-day equivalent, lemma and part-of-speech. The paper presents the structure of t...
متن کاملRUNDKAST: an Annotated Norwegian Broadcast News Speech Corpus
This paper describes the Norwegian broadcast news speech corpus RUNDKAST. The corpus contains recordings of approximately 77 hours of broadcast news shows from the Norwegian broadcasting company NRK. The corpus covers both read and spontaneous speech as well as spontaneous dialogues and multipart discussions, including frequent occurrences of non-speech material (e.g. music, jingles). The recor...
متن کاملTUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation
This article presents an overview of the existing acoustical corpuses suitable for broadcast news automatic transcription task in the Slovak language. The TUKE-BNews-SK database created in our department was built to support the application development for automatic broadcast news processing and spontaneous speech recognition of the Slovak language. The audio corpus is composed of 479 Slovak TV...
متن کامل